fix: memory leak in the webhook TLS healthcheck#2690
Conversation
648f541 to
d98533c
Compare
- The resp.Body was never closed, thus causing one connection to be leaked for each executions. - Creating a new transport based on the default transport, to inherit some of the default timeouts. Most importantly, this ensure that there is a default TLSHandshakeTimeout (10s) and dial timeout (30s). - Disable http keep alive, to avoid reuse of the same http connection. Otherwise, we may fail the healthcheck when the certs is rotated (new cert on disk wouldn't match the cert attached to the reused connection opened earlier). Fix open-policy-agent#2654 Signed-off-by: Thibault Deutsch <thibault@arista.com>
d98533c to
69300f5
Compare
acpana
left a comment
There was a problem hiding this comment.
(first, thanks for all the engagement on the issue and for opening a PR 💯 )
quick question re keep alives;
| // disable keep alives to ensure that http connection aren't reused, otherwise the check may | ||
| // fail if the cert was rotated in between | ||
| tr.DisableKeepAlives = true |
There was a problem hiding this comment.
would keep alives here just cause some network flakiness? iow, if checks retry once the certs have been rotated, would this still be a problem?
There was a problem hiding this comment.
I don't think it would be just flakiness. I didn't get the time to validate my theory, but here is the scenario that I'm thinking (with keepalive enabled):
- First health check pass, with certificate A. Because we drained the body before closing it, the connection is kept open (that's the default behaviour of the Go HTTP client, as long as the server also allow keepalive)
- Then, the certificate is renewed. Certificate A is replaced by certificate B on file, and the transport layer of the HTTP server now use certificate B for new connections.
- A new health check is started. The previous connection is open, in the pool of connections. The Go HTTP client select it and make a new request. We get a response. When we check the response, we compare what we have on disk (certificate B) with the peer cert associated to the connection (certificate A). The check fail.
- This repeat indefinitely until the connection is closed, which may never happen (or after X minutes when the maximum connection lifetime is exceeded), because we are always properly draining and closing the body, so the HTTP client should always keep the connection open.
There was a problem hiding this comment.
That seems likely. IIRC certs are only used for negotiating a session key, so rotating a cert wouldn't necessarily break a pre-existing connection.
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## master #2690 +/- ##
==========================================
- Coverage 53.27% 52.72% -0.55%
==========================================
Files 120 123 +3
Lines 10594 10941 +347
==========================================
+ Hits 5644 5769 +125
- Misses 4515 4715 +200
- Partials 435 457 +22
Flags with carried forward coverage won't be shown. Click here to find out more.
... and 3 files with indirect coverage changes Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report in Codecov by Sentry. |
maxsmythe
left a comment
There was a problem hiding this comment.
LGTM, thank you for finding/fixing this!
Co-authored-by: Sertaç Özercan <852750+sozercan@users.noreply.github.com>
Co-authored-by: Sertaç Özercan <852750+sozercan@users.noreply.github.com> Signed-off-by: Xander Grzywinski <xandergr@microsoft.com>
What this PR does / why we need it:
The resp.Body was never closed, thus causing one connection to be leaked for each executions.
Creating a new transport based on the default transport, to inherit some of the default timeouts. Most importantly, this ensure that there is a default TLSHandshakeTimeout (10s) and dial timeout (30s).
Disable http keep alive, to avoid reuse of the same http connection. Otherwise, it may fail the check when the certs is rotated (new cert on disk wouldn't match the cert attached to the reused connection opened earlier).
Which issue(s) this PR fixes
Fixes #2654
Special notes for your reviewer: